Goto

Collaborating Authors

 training period




0cddb7c06f1cd518e1efdc0e20b70c31-Supplemental.pdf

Neural Information Processing Systems

The 2048 mesh point high-resolution solution is generated using the Fourier-Galerkin spectral method [41] with the 4th order Runge-Kutta method for time stepping. From the box-filtered initial condition, the 32-point low-resolution solution is conducted using central differencing for the spatial derivatives. This choice does not introduce additional artificial viscosity; thus, the solution without closure is naturally unstable. In this setting, u can be regarded as fully resolved, thus the numerical residual r(u), defined in Eq.(10),iszero. Forthereacting flow simulation task, the different variables in the input q are normalized to the same order of magnitude.




Predictive AI with External Knowledge Infusion for Stocks

Dukkipati, Ambedkar, Mayilvaghanan, Kawin, Pallekonda, Naveen Kumar, Hadnoor, Sai Prakash, Ayyagari, Ranga Shaarad

arXiv.org Artificial Intelligence

Fluctuations in stock prices are influenced by a complex interplay of factors that go beyond mere historical data. These factors, themselves influenced by external forces, encompass inter-stock dynamics, broader economic factors, various government policy decisions, outbreaks of wars, etc. Furthermore, all of these factors are dynamic and exhibit changes over time. In this paper, for the first time, we tackle the forecasting problem under external influence by proposing learning mechanisms that not only learn from historical trends but also incorporate external knowledge from temporal knowledge graphs. Since there are no such datasets or temporal knowledge graphs available, we study this problem with stock market data, and we construct comprehensive temporal knowledge graph datasets. In our proposed approach, we model relations on external temporal knowledge graphs as events of a Hawkes process on graphs. With extensive experiments, we show that learned dynamic representations effectively rank stocks based on returns across multiple holding periods, outperforming related baselines on relevant metrics.


Efficient fine-tuning of 37-level GraphCast with the Canadian global deterministic analysis

Subich, Christopher

arXiv.org Artificial Intelligence

This work describes a process for efficiently fine-tuning the GraphCast data-driven forecast model to simulate another analysis system, here the Global Deterministic Prediction System (GDPS) of Environment and Climate Change Canada (ECCC). Using two years of training data (July 2019 -- December 2021) and 37 GPU-days of computation to tune the 37-level, quarter-degree version of GraphCast, the resulting model significantly outperforms both the unmodified GraphCast and operational forecast, showing significant forecast skill in the troposphere over lead times from 1 to 10 days. This fine-tuning is accomplished through abbreviating DeepMind's original training curriculum for GraphCast, relying on a shorter single-step forecast stage to accomplish the bulk of the adaptation work and consolidating the autoregressive stages into separate 12hr, 1d, 2d, and 3d stages with larger learning rates. Additionally, training over 3d forecasts is split into two sub-steps to conserve host memory while maintaining a strong correlation with training over the full period.


Temporal Scaling Law for Large Language Models

Xiong, Yizhe, Chen, Xiansheng, Ye, Xin, Chen, Hui, Lin, Zijia, Lian, Haoran, Su, Zhenpeng, Niu, Jianwei, Ding, Guiguang

arXiv.org Artificial Intelligence

Recently, Large Language Models (LLMs) have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss of an LLM throughout its pre-training process remains unexplored, though it is valuable in many aspects, such as selecting better hyperparameters \textit{directly} on the target LLM. In this paper, we propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position, and further develop a dynamic hyperbolic-law. Afterwards, we derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution (OOD) validation datasets demonstrate that our temporal scaling law accurately predicts the test loss of LLMs across training steps. Our temporal scaling law has broad practical applications. First, it enables direct and efficient hyperparameter selection on the target LLM, such as data mixture proportions. Secondly, viewing the LLM pre-training dynamics from the token position granularity provides some insights to enhance the understanding of LLM pre-training.


Evaluating Large Language Models for Generalization and Robustness via Data Compression

Li, Yucheng, Guo, Yunhao, Guerin, Frank, Lin, Chenghua

arXiv.org Artificial Intelligence

Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.


Applications of Sequential Learning for Medical Image Classification

Naim, Sohaib, Caffo, Brian, Sair, Haris I, Jones, Craig K

arXiv.org Artificial Intelligence

Purpose: The aim of this work is to develop a neural network training framework for continual training of small amounts of medical imaging data and create heuristics to assess training in the absence of a hold-out validation or test set. Materials and Methods: We formulated a retrospective sequential learning approach that would train and consistently update a model on mini-batches of medical images over time. We address problems that impede sequential learning such as overfitting, catastrophic forgetting, and concept drift through PyTorch convolutional neural networks (CNN) and publicly available Medical MNIST and NIH Chest X-Ray imaging datasets. We begin by comparing two methods for a sequentially trained CNN with and without base pre-training. We then transition to two methods of unique training and validation data recruitment to estimate full information extraction without overfitting. Lastly, we consider an example of real-life data that shows how our approach would see mainstream research implementation. Results: For the first experiment, both approaches successfully reach a ~95% accuracy threshold, although the short pre-training step enables sequential accuracy to plateau in fewer steps. The second experiment comparing two methods showed better performance with the second method which crosses the ~90% accuracy threshold much sooner. The final experiment showed a slight advantage with a pre-training step that allows the CNN to cross ~60% threshold much sooner than without pre-training. Conclusion: We have displayed sequential learning as a serviceable multi-classification technique statistically comparable to traditional CNNs that can acquire data in small increments feasible for clinically realistic scenarios.